Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Storage cleaner] Add wandb path implementation #400

Merged
merged 13 commits into from
Dec 15, 2023

Conversation

2015aroras
Copy link
Collaborator

@2015aroras 2015aroras commented Dec 14, 2023

This PR adds a basic implementation for getting a wandb path for a given run directory. This may have issues in some common cases:

  • If the run used save_overwrite and so the top-level config.yaml does not match all checkpoints, then the resulting wandb path won't be correct for all checkpoints of the "run". Dealing with this programmatically seems more messy than is worth.
  • If run does not have a top-level config.yaml, then the script exits with error. Using a checkpoint's config.yaml could lead to the previous scenario.
  • If the run has been deleted from wandb, then the script exits with error. We can probably deal with this manually?
  • If the run has under-specified wandb information in the config file. One can probably deduce the right run from other information, but I don't think this is worth it presently.

I am hoping that most of the runs worth keeping do not face the above issues. I can improve how the storage cleaner deals with these scenarios if we believe they are more important than I had expected.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think resolving the W&B run based on the W&B run name is going to work well in practice, at least for my runs where I typically use the same W&B name for each restart.

Alternatively:

  • Use the existing name/group filters to get a set of potential matches
  • Then filter those matches by comparing all fields in the config

This should work since we save the train config to the W&B run config.

scripts/storage_cleaner.py Outdated Show resolved Hide resolved
@2015aroras
Copy link
Collaborator Author

I don't think resolving the W&B run based on the W&B run name is going to work well in practice, at least for my runs where I typically use the same W&B name for each restart.

Alternatively:

  • Use the existing name/group filters to get a set of potential matches
  • Then filter those matches by comparing all fields in the config

This should work since we save the train config to the W&B run config.

This would help filter out some runs but I assume that you sometimes use the exact same config for several runs (I have in the past)? Are you concerned about that case too?

@epwalsh
Copy link
Member

epwalsh commented Dec 14, 2023

This would help filter out some runs but I assume that you sometimes use the exact same config for several runs (I have in the past)? Are you concerned about that case too?

I probably have some runs like this due to failures, but in general I try to remove failed runs immediately from W&B. When that happens with this script we could print out links to each of those runs so the user could check to see if some of duplicates can be removed from W&B.

@2015aroras
Copy link
Collaborator Author

I don't think resolving the W&B run based on the W&B run name is going to work well in practice, at least for my runs where I typically use the same W&B name for each restart.

Alternatively:

  • Use the existing name/group filters to get a set of potential matches
  • Then filter those matches by comparing all fields in the config

This should work since we save the train config to the W&B run config.

223eb6a Filtering by config took ~90 lines.

@epwalsh
Copy link
Member

epwalsh commented Dec 15, 2023

Hmm you might be able to simplify that a bit by trying to load the W&B config into a TrainConfig?

@2015aroras
Copy link
Collaborator Author

Hmm you might be able to simplify that a bit by trying to load the W&B config into a TrainConfig?

b4ac61e Somehow I didn't even think about this. It looks like it works fine.

@2015aroras 2015aroras merged commit 685d11b into main Dec 15, 2023
10 checks passed
@2015aroras 2015aroras deleted the shanea/storage-cleaner-wandb branch December 15, 2023 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants